A Survey of Recoverable Distributed Shared Memory Systems

نویسندگان

  • CHRISTINE MORIN
  • ISABELLE PUAUT
  • Christine Morin
چکیده

Distributed Shared Memory (dsm) systems provide a shared memory abstraction on distributed memory architectures (distributed memory multicomputers, networks of workstations). Such systems ease parallel application programming since the shared memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a dsm system increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of recoverable dsm systems (rdsm) that provide a checkpointing mechanism to restart parallel computations, after a site failure. Une synth ese des syst emes a m emoire virtuelle partag ee recouvrables R esum e : Les syst emes a m emoire virtuelle partag ee oorent a leurs utilisateurs l'illusion d'une m e-moire partag ee sur les architectures a m emoire distribu ee (r eseaux de stations de travail, machines parall eles a m emoire distribu ee). De tels syst emes facilitent la programmation des applications pa-rall eles, car le mod ele de programmation par partage de m emoire est souvent plus naturel que le mod ele de programmation par echange de messages. Toutefois, plus le nombre de composants dans un syst eme a m emoire virtuelle partag ee augmente, plus la probabilit e qu'une d efaillance se pro-duise est importante. Des m ecanismes de tol erance aux fautes doivent par cons equent ^ etre ajout es aux syst emes a m emoire virtuelle partag ee. Ce rapport eeectue un tour d'horizon des m ecanismes de sauvegarde et restauration de points de reprise dans les syst emes a m emoire virtuelle partag ee (m emoires virtuelles partag ees recouvrables). Ces m ecanismes permettent de poursuivre l'ex ecution d'une application parall ele en d epit de la d efaillance d'un site.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability

Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM). Although most recover...

متن کامل

Research on Adaptive and Recoverable Distributed Shared Memory

Software distributed shared memory (DSM) systems have many advantages over message passing systems. Since DSM provides a user a simple shared memory abstraction, the user does not have to be concerned with data movement between hosts. Many applications programmed for a multiprocessor system with shared memory can be executed on a software DSM system without significant modifications. This paper...

متن کامل

An Extended Coherence Protocol for Recoverable DSM Systems with Causal Consistency

This paper presents a coherence protocol for recoverable Distributed Shared Memory (DSM) systems with causally consistent read-write objects. It uses independent checkpointing tightly integrated with coherence operations. That integration results in high availability of shared objects and ensures fast restoration of the consistent state of DSM in spite of multiple node failures, introducing lit...

متن کامل

Architectural Issues in Adopting Distributed Shared Memory for Distributed Object Management Systems

Distributed shared memory (DSM) provides transparent network interface based on the memory abstraction. Furthermore, DSM gives us the ease of programming and portability. Also the advantages ooered by DSM include low network overhead, with no explicit operating system intervention to move data over network. With the advent of high-bandwidth networks and wide addressing, adopting DSM for distrib...

متن کامل

Replication for Efficiency and Fault Tolerance in a Dsm System

Distributed Shared Memory (DSM) systems implemented on a network of workstations (NOW) have become a convenient alternative to shared memory archi-tectures to execute long running parallel applications. However, such architectures are susceptible to experience failures. This paper presents the design and implementation of a recoverable DSM (RDSM) based on a backward error recovery (BER) mechani...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995